Predict which (known) terrorist organization is behind a given attack.

Following is a simple datascience framework followed to carefully understand our data and build models

Problem Statement

  • Predict which (known) terrorist organization is behind a given attack.

Data Extraction

  • The data has been extracted from Global Terrorism Database (GTD), a database maintained by the National Consortium for the Study of Terrorism (START) at the University of Maryland, College Park.
  • I used the version published on Kaggle in early 2017 for this project, which covers terrorist attacks happened between 1970 and 2015. (This database has been updated in July 2017 to include incidents in 2016.)

  • This data includes ~170K attacks of which Taliban was responsible for 6575 of the attacks (the highest)

In [1]:
#### Importing packages

#Data manipulation
import pandas as pd
import seaborn as sns
import numpy as np

#visualisations
import matplotlib.pyplot as plt
import matplotlib
%matplotlib inline
plt.rcParams['figure.figsize'] = (15, 8)
matplotlib.style.use('ggplot')
from plotnine import *

#Models
import xlrd
import xgboost as xgb
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier
#from sklearn.externals import joblib
from sklearn.model_selection import train_test_split
#from sklearn.manifold import TSNE
from sklearn.metrics import accuracy_score, f1_score, confusion_matrix
from sklearn.preprocessing import StandardScaler

#Other
import warnings
warnings.filterwarnings('ignore')
import itertools

#Read the data
events = pd.read_csv('data.csv', index_col='eventid')
events.head()
/media/Data/anaconda2/lib/python2.7/site-packages/statsmodels/compat/pandas.py:56: FutureWarning: The pandas.core.datetools module is deprecated and will be removed in a future version. Please use the pandas.tseries module instead.
  from pandas.core import datetools
/media/Data/anaconda2/lib/python2.7/site-packages/sklearn/cross_validation.py:41: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
  "This module will be removed in 0.20.", DeprecationWarning)
Out[1]:
iyear imonth iday approxdate extended resolution country country_txt region region_txt ... addnotes scite1 scite2 scite3 dbsource INT_LOG INT_IDEO INT_MISC INT_ANY related
eventid
197000000001 1970 7 2 NaN 0 NaN 58 Dominican Republic 2 Central America & Caribbean ... NaN NaN NaN NaN PGIS 0 0 0 0 NaN
197000000002 1970 0 0 NaN 0 NaN 130 Mexico 1 North America ... NaN NaN NaN NaN PGIS 0 1 1 1 NaN
197001000001 1970 1 0 NaN 0 NaN 160 Philippines 5 Southeast Asia ... NaN NaN NaN NaN PGIS -9 -9 1 1 NaN
197001000002 1970 1 0 NaN 0 NaN 78 Greece 8 Western Europe ... NaN NaN NaN NaN PGIS -9 -9 1 1 NaN
197001000003 1970 1 0 NaN 0 NaN 101 Japan 4 East Asia ... NaN NaN NaN NaN PGIS -9 -9 1 1 NaN

5 rows × 134 columns

Variable description

Below is a table that summarises all the variables in the dataset into different broad categories. We'll discuss more about each variable as we proceed with our analysis.

Variable Categories Variable Names
GTD ID and Date eventid, iyear, imonth, iday, extended
Attack Information summary, doubtterr, multiple, related
Attack Location country_txt, region_txt, provstate, latitude, longitude
Attack type attacktype1_txt, attacktype2_txt, attacktype3_txt, success, suicide
Weapon Information weaptype1_txt, weaptype2_txt, weaptype3_txt, weaptype4_txt
Target/Victim Information target1, targtype1_txt, natlty1_txt, target2, targtype2_txt, natlty2_txt, target3, targtype3_txt, natlty3_txt
Perpetrator Information gname, gname2, gname3, guncertain1, guncertain2, guncertain3, nperps, nperpcap, claimed, compclaim
Casualties and deaths nkill, nkillter, nwound, nwoundte, property, propextent, ishostkid, nhostkid
Additional variables INT_LOG, INT_IDEO, INT_MISC, INT_ANY, 'specificity', 'individual', 'dbsource'

Data preparation

  • Deal with categorical missing data
  • Remove data with "unknown" target groups
  • Subset the data where doubtterr == 0
  • Remove columns that have a missing value > 45%
  • Remove unwanted columns
  • Subset data to those target groups that are responsible for atleast 500 attacks
  • Impute missng values

Data preparation : Deal with categorical missing data

In [2]:
#Replace NA's of a few categorical variables with "Unknown"
def do():
    events['provstate'].fillna('Unknown', inplace=True)
    events['country'].fillna('Unknown', inplace=True)
    events['gname'].fillna('Unknown', inplace=True)
    events['corp1'].fillna('Unknown',inplace = True)

do()

Data preparation : Remove data with "unknown" target groups

In [3]:
# Remove unknown target variables

e2 = events[events['gname']!='Unknown']
print("Shape of the data after removing unknown gnames")
print e2.shape 
Shape of the data after removing unknown gnames
(92044, 134)

Data preparation : Subset the data where doubtterr == 0

In [4]:
# Doubter is a variable that tells you whether there is a doubt as to an incident being an act of terrorism
# 1 = "Yes there is a doubt"; 0 = "No there is no doubt"
# Subset data with doubtter=0
e2 = e2[e2['doubtterr']==0]
print("Shape of the data after removing doubtful terrorist attacks")
print e2.shape 
Shape of the data after removing doubtful terrorist attacks
(72098, 134)

Data preparation : Remove columns that have a missing value > 45%

In [5]:
#Removing columns that have a missing value > 45%

def remove_missing():
    missing_freq = e2.isnull().sum()
    missing_freq_df = pd.DataFrame({'column':missing_freq.index,'no_missing':missing_freq.values})
    missing_freq_df = missing_freq_df.sort_values(by='no_missing',ascending = False)

    missing_freq_df['per'] = (missing_freq_df['no_missing']/92044)*100
    #print missing_freq_df[missing_freq_df.per<45].shape

    missing_freq_df_45 = missing_freq_df[missing_freq_df.per<45]
    missing_subset_list=list(missing_freq_df_45.column)
    e3 = e2[missing_subset_list]
    print "Function successful"
    return e3

e3 = remove_missing()

print("Shape of the data after removing missing value columns")
print e3.shape
print e3.columns
Function successful
Shape of the data after removing missing value columns
(72098, 58)
Index([u'ransom', u'nperpcap', u'nwoundte', u'nperps', u'scite1', u'summary',
       u'claimed', u'nkillter', u'nwoundus', u'nkillus', u'weapdetail',
       u'weapsubtype1', u'weapsubtype1_txt', u'nwound', u'nkill',
       u'targsubtype1', u'targsubtype1_txt', u'latitude', u'longitude',
       u'natlty1', u'natlty1_txt', u'target1', u'guncertain1', u'ishostkid',
       u'city', u'specificity', u'INT_MISC', u'INT_ANY', u'INT_LOG',
       u'dbsource', u'INT_IDEO', u'iyear', u'property', u'weaptype1_txt',
       u'iday', u'extended', u'country', u'country_txt', u'region',
       u'region_txt', u'provstate', u'vicinity', u'crit1', u'crit2', u'crit3',
       u'doubtterr', u'multiple', u'success', u'suicide', u'attacktype1',
       u'attacktype1_txt', u'targtype1', u'targtype1_txt', u'corp1', u'gname',
       u'imonth', u'weaptype1', u'individual'],
      dtype='object')

Data preparation : Remove other unwanted columns

In [6]:
#Remove other unwanted columns
def drop_columns(): 
    #Remove text columns
    text_drop = ['weapdetail', 'weapsubtype1_txt', 'targsubtype1_txt','weaptype1_txt','region_txt','country_txt',
            'attacktype1_txt','targtype1_txt','target1','natlty1_txt','city','scite1','summary']
    e3.drop(text_drop, inplace=True, axis=1)
    
    #Remove other columns
    other_drop = ['latitude', 'longitude', 'specificity', 'individual', 'dbsource','iday','corp1',
                  'guncertain1','INT_MISC','INT_LOG', 'INT_IDEO','nperpcap','nwoundus','claimed','nkillus','doubtterr']
    e3.drop(other_drop,inplace=True,axis=1)
    print "Function successful"
    return e3

e3 = drop_columns()
print e3.columns
Function successful
Index([u'ransom', u'nwoundte', u'nperps', u'nkillter', u'weapsubtype1',
       u'nwound', u'nkill', u'targsubtype1', u'natlty1', u'ishostkid',
       u'INT_ANY', u'iyear', u'property', u'extended', u'country', u'region',
       u'provstate', u'vicinity', u'crit1', u'crit2', u'crit3', u'multiple',
       u'success', u'suicide', u'attacktype1', u'targtype1', u'gname',
       u'imonth', u'weaptype1'],
      dtype='object')

Data preparation : Subset data to those target groups that are responsible for atleast 500 attacks

In [7]:
#Removing groups which have less than 500 attacks
def remove_lt_500():
    groups = e3['gname'].value_counts().to_dict()
    groupsdf = pd.DataFrame.from_dict(groups, orient='index')
    groupsdf.reset_index(inplace=True)
    groupsdf.columns = ['group', '#events']
    groupsdf = groupsdf.sort_values(by=['#events'],ascending=False)
    test = groupsdf[groupsdf['#events'] > 500]
    group_list = list(test["group"])
    e4 = e3[e3.gname.isin(group_list)]
    return e4

e4 = remove_lt_500() 
print len(e4.gname.unique())
23

Data preparation : Impute missing values

In [8]:
def impute_missing():
        #Numeric columns
        #nkill 4331 missing values
        #nwound 6890 missing values

        e4['nkill'] = e4.groupby("gname").nkill.transform(lambda x:x.fillna(x.median()))
        e4['nwound'] = e4.groupby("gname").nwound.transform(lambda x:x.fillna(x.median()))
        e4['nwoundte'] = e4.groupby("gname").nwoundte.transform(lambda x:x.fillna(x.median()))
        e4['nperps'] = e4['nperps'].replace(-99,np.nan)
        e4['nperps'] = e4['nperps'].replace(-9,np.nan)
        e4['nperps'] = e4.groupby("gname").nperps.transform(lambda x:x.fillna(x.median()))
        e4['nkillter'] = e4.groupby("gname").nkillter.transform(lambda x:x.fillna(x.median()))
        # Categorical variables
        # ishostkid -9 = unknown/missing
        e4['ishostkid'] = e4['ishostkid'].replace(-9,np.nan)
        e4['ishostkid'] = e4.groupby("gname").ishostkid.transform(lambda x:x.fillna(x.mode()[0]))
        
        e4['ransom'] = e4['ransom'].replace(-9,np.nan)
        e4['ransom'] = e4.groupby("gname").ransom.transform(lambda x:x.fillna(x.mode()[0]))
        # INT_variables: -9 = unknown/missing

        e4['INT_ANY'] = e4['INT_ANY'].replace(-9,np.nan)
        e4['INT_ANY'] = e4.groupby("gname").INT_ANY.transform(lambda x:x.fillna(x.mode()[0]))
        
        # weapsubtype1 = 13 = unknown/missing
        e4['weapsubtype1'] = e4['weapsubtype1'].replace(13,np.nan)
        e4['weapsubtype1'] = e4.groupby("gname").weapsubtype1.transform(lambda x:x.fillna(x.mode()[0]))

        # property = -9 = unknown/missing
        e4['property'] = e4['property'].replace(-9,np.nan)
        e4['property'] = e4.groupby("gname").property.transform(lambda x:x.fillna(x.mode()[0]))

        # targsubtype1 has 1963 missingvalues
        e4['targsubtype1'] = e4.groupby("gname").targsubtype1.transform(lambda x:x.fillna(x.mode()[0]))

        # natlty1 has 329 missingvalues
        e4['natlty1'] = e4.groupby("country").natlty1.transform(lambda x:x.fillna(x.mode()[0]))

        # attacktype1 9 = unknown/missing
        e4['attacktype1'] = e4['attacktype1'].replace(9,np.nan)
        e4['attacktype1'] = e4.groupby("gname").attacktype1.transform(lambda x:x.fillna(x.mode()[0]))

        #targtype1 20 = unknown/missing
        e4['targtype1'] = e4['targtype1'].replace(13,np.nan)
        e4['targtype1'] = e4.groupby("gname").targtype1.transform(lambda x:x.fillna(x.mode()[0]))
        return e4
e4 = impute_missing()   

Understanding Features

Variable Description Type
ransom 1 = "Yes" The incident involved a demand of monetary ransom; 0 = "No" The incident did not involve a demand of monetaryransom; -9 = "Unknown" It is unknown if the incident involved a demand ofmonetary ransom. Categorical
nwound This field records the number of confirmed non-fatal injuries to both perpetrators and victims Numeric
INT_ANY 1 = "Yes" The attack was international on any of the dimensionsdescribed above (logistically, ideologically, miscellaneous);0 = "No" The attack was domestic on all of the dimensions describedabove (logistically, ideologically, miscellaneous);-9 = "Unknown" It is unknown if the attack was international or domestic; thevalue for one or more dimensions is unknown. Categorical
provstate This variable records the name (at the time of event) of the 1st order subnationaladministrative region in which the event occurs. Text
multiple In those cases where several attacks are connected, but where the various actions do not constitute a single incident (either the time of occurrence of incidents or their locations are discontinuous – see Single Incident Determination section above), then “Yes” is selected to denote that the particular attack was part of a “multiple” incident. 1 = "Yes" The attack is part of a multiple incident. 0 = "No" The attack is not part of a multiple incident Categorical
gname Attack group Categorical
nwoundte Number of Perpetrators Injured Numeric
nkill This field stores the number of total confirmed fatalities for the incident. The number includes all victims and attackers who died as a direct result of the inciden Numeric
iyear This field contains the year in which the incident occurred. Categorical
vicinity 1 = "Yes" The incident occurred in the immediate vicinity of the city in question. 0 = "No" The incident in the city itself. Categorical
success 1 = "Yes" The incident was successful. 0 = "No" The incident was not successful Categorical
imonth This field contains the month in which the incident occurred. Categorical
nperps Number of Perpetrators
targsubtype1 The target subtype variable captures the more specific target category and providesthe next level of designation for each target type. Categorical
property “Yes” appears if there is evidence of property damage from the incident.1 = "Yes" The incident resulted in property damage.0 = "No" The incident did not result in property damage.-9 = "Unknown" It is unknown if the incident resulted in propertydamage Categorical
crit1 The violent act must be aimed at attaining a political, economic, religious, orsocial goal. This criterion is not satisfied in those cases where the perpetrator(s)acted out of a pure profit motive or from an idiosyncratic personal motiveunconnected with broader societal change.1 = "Yes" The incident meets Criterion 1.0 = "No" The incident does not meet Criterion 1 or there is no indicationthat the incident meets Criterion 1. Categorical
suicide This variable is coded “Yes” in those cases where there is evidence that theperpetrator did not intend to escape from the attack alive.1 = "Yes" The incident was a suicide attack.0 = "No" There is no indication that the incident was a suicide attack. Categorical
weaptype1 Up to four weapon types are recorded for each incident. Categorical
nkillter Number of Perpetrator Fatalities Numeric
natlty1 Nationality of Target/Victim Categorical
extended 1 = "Yes" The duration of an incident extended more than 24 hours.0 = "No" The duration of an incident extended less than 24 hours. Categorical
crit2 To satisfy this criterion there must be evidence of an intention to coerce,intimidate, or convey some other message to a larger audience (or audiences)than the immediate victims. Such evidence can include (but is not limited to) thefollowing: pre- or post-attack statements by the perpetrator(s), past behavior bythe perpetrators, or the particular nature of the target/victim, weapon, or attacktype.1 = "Yes" The incident meets Criterion 2.0 = "No" The incident does not meet Criterion 2 or no indication. Categorical
attacktype1 This field captures the general method of attack and often reflects the broad class oftactics used Categorical
weapsubtype1 This field records a more specific value for most of the Weapon Types Categorical
ishostkid This field records whether or not the victims were taken hostage (i.e. held againsttheir will) or kidnapped (i.e. held against their will and taken to another location)during an incident.1 = "Yes" The victims were taken hostage or kidnapped.0 = "No" The victims were not taken hostage or kidnapped.Hostages or Kidnapping Victims 51GLOBAL TERRORISM DATABASE CODEBOOK ©UNIVERSITY OF MARYLAND JUNE 2017-9 = "Unknown" It is unknown if the victims were taken hostage orkidnapped. Categorical
country country or location where the incident occurred Categorical
crit3 The action is outside the context of legitimate warfare activities, insofar as ittargets non-combatants (i.e. the act must be outside the parameters permittedby international humanitarian law as reflected in the Additional Protocol to theGeneva Conventions of 12 August 1949 and elsewhere).1 = "Yes" The incident meets Criterion 3.0 = "No" The incident does not meet Criterion 3 Categorical
targtype1 The target/victim type field captures the general type of target/victim. Categorical
region This field identifies the region in which the incident occurred Categorical

Exploratory data analysis

  1. Understand what categorical features influence the target_group
  2. Understand what numeric features influence the target_group

Profiling Categorical variables

In [16]:
%matplotlib inline

def data_prof1(col):     
    s = e4.groupby(["gname",col]).size().reset_index(name='counts')
    s[col] = s[col].astype('category')
    t = "#Attacks by"+" "+col
    p1 = (ggplot(s,aes(x='gname',y='counts',fill=col))+theme(figure_size=(11, 6), axis_text_x=element_text(rotation=90))+geom_col()+ labs(title=t)) 
    print p1   
    
list_feat = ["weaptype1","natlty1","property","region","success","targtype1",
             "suicide","iyear","country","attacktype1","INT_ANY","ishostkid","ransom"]
for col in list_feat: 
    data_prof1(col)  
   
<ggplot: (8739514195941)>
<ggplot: (8739509909665)>
<ggplot: (8739514249293)>
<ggplot: (8739508906769)>
<ggplot: (8739508701997)>
<ggplot: (8739508709005)>
<ggplot: (8739513882605)>
<ggplot: (8739508711685)>
<ggplot: (8739514232181)>
<ggplot: (8739508394269)>
<ggplot: (8739506573345)>
<ggplot: (8739514050341)>
<ggplot: (8739508500245)>

Profiling categorical variables:

  • Above I have plotted a heatmap of no of attacks for each target group groupedby categorical variables.
  • A quick look at the visuals will help us understand if any of the categorical variables influence the "target group"

Analysis:

  • For the categorical variable "natlty1":
    • We can see that some terrorist groups prefer attacking specific "nationalities".
    • For eg: In the chart, you can see that Talibans mostly prefer attacking nationality no's 999 and 603. 999 in this case refers to multinational groups.
    • Similarly the ISIL group prefer's attacking nationality no 95( majority green bars) which refers to Iraq.
  • For the categorical variable "region":
    • Clearly terrorist groups have specific "regions" of attack
    • For eg: In the chart, you can see that Talibans have only attacked region6 = South Asia
    • FMLN group have attacked only in region2 = Central America & Caribbean, and so forth.
  • For the categorical variable "attacktype1":
    • Almost all terrorist groups have a higher percentage of attack type "Bombing"
    • Groups like Taliban, ETA and IRA also have attacked by "Assasinations"
  • For the categorical variable "ransom":
    • It's a good idea to drop this column as most groups did not demand a ransom
  • For the categorical variable "INT_ANY":
    • INT_ANY = where the attack was international or domestic. Some terrorist groups have attacked internationally and some domestically. Since there seems to be some decent level of variation, it might be interesting to keep this feature.

Profiling numeric variables

In [112]:
def data_prof_numeric(col1,col2):     
    s = e4.groupby(["gname",col1]).agg({col2:"sum","country":"count"}).reset_index()
    s.columns = ["gname",col1,"# Attacks",col2]
    t = "#Attacks"+" "+"grouped by"+" "+col1+""+"-"+""+"Heat mapping by"+" "+col2
    p1 = (ggplot(s,aes(x='gname',y='# Attacks',fill=col2))+theme(figure_size=(11, 6), axis_text_x=element_text(rotation=90))+geom_col()+ facet_wrap((col1,))+labs(title=t)) 
    print p1   

list_feat1 = ["weaptype1","attacktype1","region"]
list_feat2 = ["nkill","nwound","nkillter","nwoundte","nperps"]                 
for col1 in list_feat1: 
    for col2 in list_feat2:
        data_prof_numeric(col1,col2)  
<ggplot: (8751491739929)>
<ggplot: (8751491933181)>
<ggplot: (8751492466317)>
<ggplot: (8751491373577)>
<ggplot: (8751491419609)>
<ggplot: (8751491349225)>
<ggplot: (8751492252637)>
<ggplot: (8751492710309)>
<ggplot: (8751490928701)>
<ggplot: (8751492022033)>
<ggplot: (8751491223329)>
<ggplot: (8751492574301)>
<ggplot: (8751486479745)>
<ggplot: (8751492615081)>
<ggplot: (8751491075797)>

Profiling numeric variables:

  • The above charts are self-explanatory. Few points have been mentioned below.

Analysis:

  • Quickly scanning throught the charts, we see that most of the attacks have happened using very few weapon choices.
  • In the first chart, we see that ISIL has highest #attacks and highest #kills while using weapon type6. So maybe a combination of the two features "weapon type" and "nkill" could be helpful as a feature. </br>
  • Moving on to the chart "Attacks grouped by attacktype1 - heat mapping by nkillter
    • nkillter = perpetrator injuries
    • Most of attacks happened when the attack types were 1=assasination 2 = armed assault 3=bombing
    • Of which, few terrorist groups suffered higher% injuries as well </br>
  • Moving on to the chart "Attacks grouped by attacktype1 - heat mapping by nperps
    • nperps-no of terrorists participating in the attacks
    • For groups such as Ansar Allah and Taliban, the #terrorists participating is very high, again most of the concentration of the attacks is by attacktype = "bombing"

Most of the data is concentrated to limited terrorist groups and attacktypes and specific regions. However, we will proceed with this knowledge.

In [10]:
e4.columns
Out[10]:
Index([u'ransom', u'nwoundte', u'nperps', u'nkillter', u'weapsubtype1',
       u'nwound', u'nkill', u'targsubtype1', u'natlty1', u'ishostkid',
       u'INT_ANY', u'iyear', u'property', u'extended', u'country', u'region',
       u'provstate', u'vicinity', u'crit1', u'crit2', u'crit3', u'multiple',
       u'success', u'suicide', u'attacktype1', u'targtype1', u'gname',
       u'imonth', u'weaptype1'],
      dtype='object')

Converting categorical variables to dummy variables

In [19]:
data = e4
cat_vars=['ransom','weapsubtype1','targsubtype1','natlty1','ishostkid','INT_ANY','iyear','property',
          'extended','country','region','multiple','success','suicide','attacktype1','targtype1','imonth',
          'weaptype1']
for var in cat_vars:
    cat_list='var'+'_'+var
    cat_list = pd.get_dummies(e4[var], prefix=var)
    data1=data.join(cat_list)
    data=data1
data_vars=data.columns.values.tolist()
to_keep=[i for i in data_vars if i not in cat_vars]

data_final=data[to_keep]

#Dropping following variables      
to_drop1 = ['ransom', 'weapsubtype1', 'targsubtype1', 'natlty1', 'ishostkid',
       'INT_ANY', 'iyear', 'property', 'extended', 'country', 'region',
       'provstate', 'vicinity', 'crit1', 'crit2', 'crit3',
       'multiple', 'success', 'suicide', 'attacktype1', 'targtype1', 'imonth', 'weaptype1']
data.drop(to_drop1, inplace=True, axis=1)

Divide the data into training and testing

  • Before we divide our data intro training and testing, we will use only 50% of our original data for model training and validation. This is only for the purpose of my convenience to optimize on the run time of certain algorithms.

  • We will use sklearn function to split the training data in two datasets; 80/20 split. This is important, so we don't overfit our model. Meaning, the algorithm is so specific to a given subset, it cannot accurately generalize another subset, from the same dataset. It's important our algorithm has not seen the subset we will use to test, so it doesn't "cheat" by memorizing the answers. We will use sklearn's train_test_split function

In [22]:
print data_subset.shape
(19372, 443)
In [58]:
# Training and testing
seed = 10
data_subset = data.sample(frac=0.5, replace=True)

x = data_subset.drop(['gname'], axis=1)
y = data_subset['gname']

#Train test split
seed = 7
x_train, x_test, y_train, y_test = train_test_split(x, y, 
                                                    test_size=0.2, 
                                                    random_state=42)
print data_subset.shape
(19372, 443)

Data Modeling and Validation

Neural networks

  • In terms of data cleaning, all the features have been scaled, the target variable is one-hot encoded before model training.
  • The model is built on the training dataset and tested on the testing dataset.
  • The first three models are neural network classifiers with two hidden layers.
  • Based on trial and error experiments the number of dense layers and the activation functions have been set.
  • I have used two "sigmoid activation" functions for the middle layers and a "softmax" layer as the output activation.
  • 3 neural network models have been built with three optimizer functions rmsprop,adadelta and stochastic gradient descent, of which the rmsprop optimizer provides the best results.

Neural networks : Data Preparation

In [59]:
from sklearn.preprocessing import LabelBinarizer

#one-hot encoding
encoder = LabelBinarizer()
Y = encoder.fit_transform(y_train)
Y_test = encoder.fit_transform(y_test)

#Scaling features
X = StandardScaler().fit(x_train).transform(x_train)
X_test = StandardScaler().fit(x_test).transform(x_test)

print list(encoder.inverse_transform(Y_test[0:5]))
['Islamic State of Iraq and the Levant (ISIL)', 'Al-Shabaab', 'Irish Republican Army (IRA)', 'Taliban', 'Communist Party of India - Maoist (CPI-Maoist)']

Neural networks : Model training using Keras

In [61]:
from keras.models import Sequential #Sequential Models
from keras.layers import Dense, Dropout, Activation
from keras.optimizers import SGD #Stochastic Gradient Descent Optimizer

def create_model(opt):
    model = Sequential()
    model.add(Dense(1024, input_dim=442))
    model.add(Dropout(0.2))
    model.add(Activation('sigmoid'))
    model.add(Dense(512))
    model.add(Dropout(0.3))
    model.add(Activation('sigmoid'))
    model.add(Dense(23))
    model.add(Activation('softmax'))
    model.compile(loss='categorical_crossentropy',optimizer=opt,metrics=['accuracy'])
    return model

sgd = SGD(lr=0.01, decay=1e-6, momentum=0.9, nesterov=True)
model1 = create_model(sgd)
model2 = create_model("rmsprop")
model3 = create_model("adadelta")

history1 = model1.fit(X, Y, validation_data=(X_test,Y_test), nb_epoch=50, batch_size=2000)
history2 = model2.fit(X, Y, validation_data=(X_test,Y_test), nb_epoch=50, batch_size=2000)
history3 = model3.fit(X, Y, validation_data=(X_test,Y_test), nb_epoch=50, batch_size=2000)
Train on 15497 samples, validate on 3875 samples
Epoch 1/50
15497/15497 [==============================] - 4s 288us/step - loss: 3.0534 - acc: 0.1102 - val_loss: 2.9092 - val_acc: 0.1489
Epoch 2/50
15497/15497 [==============================] - 3s 201us/step - loss: 2.9280 - acc: 0.1491 - val_loss: 2.8814 - val_acc: 0.1486
Epoch 3/50
15497/15497 [==============================] - 3s 202us/step - loss: 2.8850 - acc: 0.1531 - val_loss: 2.8392 - val_acc: 0.1559
Epoch 4/50
15497/15497 [==============================] - 3s 225us/step - loss: 2.8477 - acc: 0.1660 - val_loss: 2.8027 - val_acc: 0.1515
Epoch 5/50
15497/15497 [==============================] - 4s 256us/step - loss: 2.8164 - acc: 0.1783 - val_loss: 2.7674 - val_acc: 0.1512
Epoch 6/50
15497/15497 [==============================] - 4s 237us/step - loss: 2.7844 - acc: 0.1860 - val_loss: 2.7337 - val_acc: 0.1907
Epoch 7/50
15497/15497 [==============================] - 4s 230us/step - loss: 2.7498 - acc: 0.2020 - val_loss: 2.6990 - val_acc: 0.2844
Epoch 8/50
15497/15497 [==============================] - 4s 239us/step - loss: 2.7164 - acc: 0.2463 - val_loss: 2.6599 - val_acc: 0.2937
Epoch 9/50
15497/15497 [==============================] - 4s 242us/step - loss: 2.6776 - acc: 0.2799 - val_loss: 2.6214 - val_acc: 0.2694
Epoch 10/50
15497/15497 [==============================] - 4s 287us/step - loss: 2.6410 - acc: 0.2833 - val_loss: 2.5792 - val_acc: 0.3363
Epoch 11/50
15497/15497 [==============================] - 4s 277us/step - loss: 2.5999 - acc: 0.3177 - val_loss: 2.5357 - val_acc: 0.3365
Epoch 12/50
15497/15497 [==============================] - 4s 252us/step - loss: 2.5559 - acc: 0.3241 - val_loss: 2.4903 - val_acc: 0.3468
Epoch 13/50
15497/15497 [==============================] - 4s 263us/step - loss: 2.5111 - acc: 0.3435 - val_loss: 2.4413 - val_acc: 0.3677
Epoch 14/50
15497/15497 [==============================] - 5s 322us/step - loss: 2.4604 - acc: 0.3518 - val_loss: 2.3891 - val_acc: 0.3814
Epoch 15/50
15497/15497 [==============================] - 5s 339us/step - loss: 2.4106 - acc: 0.3709 - val_loss: 2.3342 - val_acc: 0.3804
Epoch 16/50
15497/15497 [==============================] - 6s 356us/step - loss: 2.3565 - acc: 0.3772 - val_loss: 2.2767 - val_acc: 0.4023
Epoch 17/50
15497/15497 [==============================] - 5s 339us/step - loss: 2.2985 - acc: 0.3960 - val_loss: 2.2159 - val_acc: 0.4090
Epoch 18/50
15497/15497 [==============================] - 5s 338us/step - loss: 2.2397 - acc: 0.4089 - val_loss: 2.1520 - val_acc: 0.4444
Epoch 19/50
15497/15497 [==============================] - 5s 298us/step - loss: 2.1758 - acc: 0.4459 - val_loss: 2.0865 - val_acc: 0.4805
Epoch 20/50
15497/15497 [==============================] - 5s 338us/step - loss: 2.1116 - acc: 0.4796 - val_loss: 2.0181 - val_acc: 0.5533
Epoch 21/50
15497/15497 [==============================] - 4s 284us/step - loss: 2.0461 - acc: 0.5457 - val_loss: 1.9490 - val_acc: 0.5721
Epoch 22/50
15497/15497 [==============================] - 5s 294us/step - loss: 1.9794 - acc: 0.5655 - val_loss: 1.8790 - val_acc: 0.6310
Epoch 23/50
15497/15497 [==============================] - 4s 281us/step - loss: 1.9106 - acc: 0.6124 - val_loss: 1.8090 - val_acc: 0.6503
Epoch 24/50
15497/15497 [==============================] - 5s 308us/step - loss: 1.8429 - acc: 0.6435 - val_loss: 1.7391 - val_acc: 0.6779
Epoch 25/50
15497/15497 [==============================] - 5s 339us/step - loss: 1.7763 - acc: 0.6668 - val_loss: 1.6702 - val_acc: 0.6926
Epoch 26/50
15497/15497 [==============================] - 5s 335us/step - loss: 1.7091 - acc: 0.6845 - val_loss: 1.6026 - val_acc: 0.7190
Epoch 27/50
15497/15497 [==============================] - 4s 282us/step - loss: 1.6433 - acc: 0.7057 - val_loss: 1.5364 - val_acc: 0.7303
Epoch 28/50
15497/15497 [==============================] - 5s 301us/step - loss: 1.5800 - acc: 0.7177 - val_loss: 1.4718 - val_acc: 0.7481
Epoch 29/50
15497/15497 [==============================] - 5s 352us/step - loss: 1.5187 - acc: 0.7335 - val_loss: 1.4097 - val_acc: 0.7605
Epoch 30/50
15497/15497 [==============================] - 6s 372us/step - loss: 1.4599 - acc: 0.7503 - val_loss: 1.3491 - val_acc: 0.7775
Epoch 31/50
15497/15497 [==============================] - 5s 334us/step - loss: 1.4002 - acc: 0.7629 - val_loss: 1.2909 - val_acc: 0.7894
Epoch 32/50
15497/15497 [==============================] - 5s 323us/step - loss: 1.3449 - acc: 0.7744 - val_loss: 1.2350 - val_acc: 0.8034
Epoch 33/50
15497/15497 [==============================] - 5s 312us/step - loss: 1.2902 - acc: 0.7868 - val_loss: 1.1808 - val_acc: 0.8119
Epoch 34/50
15497/15497 [==============================] - 5s 317us/step - loss: 1.2367 - acc: 0.7980 - val_loss: 1.1287 - val_acc: 0.8191
Epoch 35/50
15497/15497 [==============================] - 5s 338us/step - loss: 1.1862 - acc: 0.8069 - val_loss: 1.0784 - val_acc: 0.8255
Epoch 36/50
15497/15497 [==============================] - 6s 362us/step - loss: 1.1373 - acc: 0.8156 - val_loss: 1.0311 - val_acc: 0.8325
Epoch 37/50
15497/15497 [==============================] - 5s 341us/step - loss: 1.0917 - acc: 0.8244 - val_loss: 0.9854 - val_acc: 0.8374
Epoch 38/50
15497/15497 [==============================] - 5s 315us/step - loss: 1.0462 - acc: 0.8313 - val_loss: 0.9417 - val_acc: 0.8431
Epoch 39/50
15497/15497 [==============================] - 5s 348us/step - loss: 1.0044 - acc: 0.8403 - val_loss: 0.9006 - val_acc: 0.8526
Epoch 40/50
15497/15497 [==============================] - 5s 344us/step - loss: 0.9613 - acc: 0.8462 - val_loss: 0.8610 - val_acc: 0.8606
Epoch 41/50
15497/15497 [==============================] - 5s 349us/step - loss: 0.9231 - acc: 0.8558 - val_loss: 0.8236 - val_acc: 0.8697
Epoch 42/50
15497/15497 [==============================] - 5s 340us/step - loss: 0.8866 - acc: 0.8602 - val_loss: 0.7888 - val_acc: 0.8735
Epoch 43/50
15497/15497 [==============================] - 5s 317us/step - loss: 0.8513 - acc: 0.8690 - val_loss: 0.7556 - val_acc: 0.8818
Epoch 44/50
15497/15497 [==============================] - 5s 315us/step - loss: 0.8172 - acc: 0.8762 - val_loss: 0.7238 - val_acc: 0.8834
Epoch 45/50
15497/15497 [==============================] - 5s 353us/step - loss: 0.7871 - acc: 0.8778 - val_loss: 0.6947 - val_acc: 0.8877
Epoch 46/50
15497/15497 [==============================] - 6s 407us/step - loss: 0.7568 - acc: 0.8831 - val_loss: 0.6670 - val_acc: 0.8939
Epoch 47/50
15497/15497 [==============================] - 6s 358us/step - loss: 0.7281 - acc: 0.8871 - val_loss: 0.6404 - val_acc: 0.8942
Epoch 48/50
15497/15497 [==============================] - 6s 367us/step - loss: 0.7018 - acc: 0.8889 - val_loss: 0.6160 - val_acc: 0.8950
Epoch 49/50
15497/15497 [==============================] - 5s 351us/step - loss: 0.6758 - acc: 0.8913 - val_loss: 0.5929 - val_acc: 0.8986
Epoch 50/50
15497/15497 [==============================] - 5s 338us/step - loss: 0.6520 - acc: 0.8955 - val_loss: 0.5715 - val_acc: 0.9017
Train on 15497 samples, validate on 3875 samples
Epoch 1/50
15497/15497 [==============================] - 6s 408us/step - loss: 3.0430 - acc: 0.1425 - val_loss: 2.3898 - val_acc: 0.3406
Epoch 2/50
15497/15497 [==============================] - 6s 366us/step - loss: 2.3566 - acc: 0.4054 - val_loss: 1.9219 - val_acc: 0.6612
Epoch 3/50
15497/15497 [==============================] - 5s 340us/step - loss: 1.7711 - acc: 0.5717 - val_loss: 1.2138 - val_acc: 0.7099
Epoch 4/50
15497/15497 [==============================] - 5s 338us/step - loss: 1.2182 - acc: 0.7575 - val_loss: 0.8266 - val_acc: 0.8297
Epoch 5/50
15497/15497 [==============================] - 5s 350us/step - loss: 0.8059 - acc: 0.8760 - val_loss: 0.5425 - val_acc: 0.9074
Epoch 6/50
15497/15497 [==============================] - 5s 332us/step - loss: 0.5772 - acc: 0.9029 - val_loss: 0.4136 - val_acc: 0.9032
Epoch 7/50
15497/15497 [==============================] - 5s 334us/step - loss: 0.4177 - acc: 0.9107 - val_loss: 0.3033 - val_acc: 0.9133
Epoch 8/50
15497/15497 [==============================] - 5s 293us/step - loss: 0.3216 - acc: 0.9244 - val_loss: 0.2708 - val_acc: 0.8975
Epoch 9/50
15497/15497 [==============================] - 7s 443us/step - loss: 0.2797 - acc: 0.9149 - val_loss: 0.2057 - val_acc: 0.9378
Epoch 10/50
15497/15497 [==============================] - 5s 310us/step - loss: 0.2163 - acc: 0.9379 - val_loss: 0.2042 - val_acc: 0.9156
Epoch 11/50
15497/15497 [==============================] - 5s 303us/step - loss: 0.1945 - acc: 0.9404 - val_loss: 0.1650 - val_acc: 0.9417
Epoch 12/50
15497/15497 [==============================] - 5s 321us/step - loss: 0.1667 - acc: 0.9482 - val_loss: 0.1822 - val_acc: 0.9445
Epoch 13/50
15497/15497 [==============================] - 5s 348us/step - loss: 0.1491 - acc: 0.9524 - val_loss: 0.1854 - val_acc: 0.9316
Epoch 14/50
15497/15497 [==============================] - 5s 325us/step - loss: 0.1397 - acc: 0.9568 - val_loss: 0.1291 - val_acc: 0.9546
Epoch 15/50
15497/15497 [==============================] - 5s 334us/step - loss: 0.1238 - acc: 0.9608 - val_loss: 0.1393 - val_acc: 0.9484
Epoch 16/50
15497/15497 [==============================] - 5s 304us/step - loss: 0.1132 - acc: 0.9599 - val_loss: 0.1342 - val_acc: 0.9582
Epoch 17/50
15497/15497 [==============================] - 4s 279us/step - loss: 0.1046 - acc: 0.9653 - val_loss: 0.1163 - val_acc: 0.9592
Epoch 18/50
15497/15497 [==============================] - 11s 708us/step - loss: 0.1096 - acc: 0.9611 - val_loss: 0.1329 - val_acc: 0.9494
Epoch 19/50
15497/15497 [==============================] - 11s 686us/step - loss: 0.1015 - acc: 0.9610 - val_loss: 0.1241 - val_acc: 0.9556
Epoch 20/50
15497/15497 [==============================] - 11s 687us/step - loss: 0.0878 - acc: 0.9681 - val_loss: 0.1117 - val_acc: 0.9605
Epoch 21/50
15497/15497 [==============================] - 11s 689us/step - loss: 0.0814 - acc: 0.9714 - val_loss: 0.1234 - val_acc: 0.9507
Epoch 22/50
15497/15497 [==============================] - 11s 693us/step - loss: 0.0953 - acc: 0.9635 - val_loss: 0.1368 - val_acc: 0.9489
Epoch 23/50
15497/15497 [==============================] - 11s 688us/step - loss: 0.0822 - acc: 0.9690 - val_loss: 0.1358 - val_acc: 0.9600
Epoch 24/50
15497/15497 [==============================] - 11s 689us/step - loss: 0.0775 - acc: 0.9710 - val_loss: 0.1111 - val_acc: 0.9621
Epoch 25/50
15497/15497 [==============================] - 11s 690us/step - loss: 0.0812 - acc: 0.9686 - val_loss: 0.1150 - val_acc: 0.9585
Epoch 26/50
15497/15497 [==============================] - 11s 688us/step - loss: 0.0742 - acc: 0.9727 - val_loss: 0.1171 - val_acc: 0.9572
Epoch 27/50
15497/15497 [==============================] - 11s 708us/step - loss: 0.0771 - acc: 0.9713 - val_loss: 0.1142 - val_acc: 0.9639
Epoch 28/50
15497/15497 [==============================] - 11s 732us/step - loss: 0.0696 - acc: 0.9731 - val_loss: 0.1183 - val_acc: 0.9631
Epoch 29/50
15497/15497 [==============================] - 11s 689us/step - loss: 0.0754 - acc: 0.9723 - val_loss: 0.1238 - val_acc: 0.9600
Epoch 30/50
15497/15497 [==============================] - 11s 689us/step - loss: 0.0653 - acc: 0.9745 - val_loss: 0.1195 - val_acc: 0.9569
Epoch 31/50
15497/15497 [==============================] - 11s 723us/step - loss: 0.0763 - acc: 0.9706 - val_loss: 0.1251 - val_acc: 0.9554
Epoch 32/50
15497/15497 [==============================] - 11s 690us/step - loss: 0.0670 - acc: 0.9730 - val_loss: 0.1139 - val_acc: 0.9600
Epoch 33/50
15497/15497 [==============================] - 11s 690us/step - loss: 0.0645 - acc: 0.9754 - val_loss: 0.1141 - val_acc: 0.9652
Epoch 34/50
15497/15497 [==============================] - 11s 690us/step - loss: 0.0668 - acc: 0.9734 - val_loss: 0.1135 - val_acc: 0.9626
Epoch 35/50
15497/15497 [==============================] - 11s 693us/step - loss: 0.0597 - acc: 0.9769 - val_loss: 0.1140 - val_acc: 0.9615
Epoch 36/50
15497/15497 [==============================] - 11s 691us/step - loss: 0.0615 - acc: 0.9759 - val_loss: 0.1204 - val_acc: 0.9582
Epoch 37/50
15497/15497 [==============================] - 11s 694us/step - loss: 0.0658 - acc: 0.9752 - val_loss: 0.1089 - val_acc: 0.9657
Epoch 38/50
15497/15497 [==============================] - 11s 693us/step - loss: 0.0559 - acc: 0.9785 - val_loss: 0.1227 - val_acc: 0.9566
Epoch 39/50
15497/15497 [==============================] - 11s 690us/step - loss: 0.0581 - acc: 0.9783 - val_loss: 0.1189 - val_acc: 0.9615
Epoch 40/50
15497/15497 [==============================] - 11s 691us/step - loss: 0.0628 - acc: 0.9749 - val_loss: 0.1105 - val_acc: 0.9683
Epoch 41/50
15497/15497 [==============================] - 11s 692us/step - loss: 0.0597 - acc: 0.9757 - val_loss: 0.1188 - val_acc: 0.9659
Epoch 42/50
15497/15497 [==============================] - 11s 689us/step - loss: 0.0519 - acc: 0.9803 - val_loss: 0.1254 - val_acc: 0.9646
Epoch 43/50
15497/15497 [==============================] - 11s 691us/step - loss: 0.0570 - acc: 0.9777 - val_loss: 0.1244 - val_acc: 0.9569
Epoch 44/50
15497/15497 [==============================] - 11s 692us/step - loss: 0.0590 - acc: 0.9774 - val_loss: 0.1231 - val_acc: 0.9605
Epoch 45/50
15497/15497 [==============================] - 11s 692us/step - loss: 0.0498 - acc: 0.9812 - val_loss: 0.1343 - val_acc: 0.9634
Epoch 46/50
15497/15497 [==============================] - 11s 689us/step - loss: 0.0563 - acc: 0.9787 - val_loss: 0.1200 - val_acc: 0.9654
Epoch 47/50
15497/15497 [==============================] - 11s 692us/step - loss: 0.0479 - acc: 0.9823 - val_loss: 0.1341 - val_acc: 0.9639
Epoch 48/50
15497/15497 [==============================] - 11s 689us/step - loss: 0.0508 - acc: 0.9810 - val_loss: 0.1301 - val_acc: 0.9623
Epoch 49/50
15497/15497 [==============================] - 11s 696us/step - loss: 0.0536 - acc: 0.9795 - val_loss: 0.1220 - val_acc: 0.9618
Epoch 50/50
15497/15497 [==============================] - 11s 690us/step - loss: 0.0487 - acc: 0.9817 - val_loss: 0.1214 - val_acc: 0.9639
Train on 15497 samples, validate on 3875 samples
Epoch 1/50
15497/15497 [==============================] - 13s 849us/step - loss: 3.0219 - acc: 0.1212 - val_loss: 2.8154 - val_acc: 0.1004
Epoch 2/50
15497/15497 [==============================] - 11s 702us/step - loss: 2.7677 - acc: 0.1598 - val_loss: 2.5503 - val_acc: 0.2057
Epoch 3/50
15497/15497 [==============================] - 11s 702us/step - loss: 2.5482 - acc: 0.2240 - val_loss: 2.2453 - val_acc: 0.3693
Epoch 4/50
15497/15497 [==============================] - 11s 702us/step - loss: 2.2539 - acc: 0.3802 - val_loss: 2.0233 - val_acc: 0.3481
Epoch 5/50
15497/15497 [==============================] - 11s 700us/step - loss: 1.9935 - acc: 0.4354 - val_loss: 1.8653 - val_acc: 0.4462
Epoch 6/50
15497/15497 [==============================] - 11s 705us/step - loss: 1.6724 - acc: 0.5931 - val_loss: 1.4369 - val_acc: 0.7548
Epoch 7/50
15497/15497 [==============================] - 11s 700us/step - loss: 1.2989 - acc: 0.7504 - val_loss: 1.0440 - val_acc: 0.8733
Epoch 8/50
15497/15497 [==============================] - 11s 699us/step - loss: 1.0416 - acc: 0.8061 - val_loss: 0.7464 - val_acc: 0.8957
Epoch 9/50
15497/15497 [==============================] - 11s 725us/step - loss: 0.7741 - acc: 0.8919 - val_loss: 0.5922 - val_acc: 0.9205
Epoch 10/50
15497/15497 [==============================] - 11s 703us/step - loss: 0.6346 - acc: 0.9073 - val_loss: 0.4782 - val_acc: 0.9130
Epoch 11/50
15497/15497 [==============================] - 11s 701us/step - loss: 0.5340 - acc: 0.9104 - val_loss: 0.4076 - val_acc: 0.8999
Epoch 12/50
15497/15497 [==============================] - 11s 702us/step - loss: 0.4592 - acc: 0.9142 - val_loss: 0.3525 - val_acc: 0.9182
Epoch 13/50
15497/15497 [==============================] - 11s 699us/step - loss: 0.4027 - acc: 0.9213 - val_loss: 0.3139 - val_acc: 0.9203
Epoch 14/50
15497/15497 [==============================] - 12s 766us/step - loss: 0.3626 - acc: 0.9188 - val_loss: 0.2945 - val_acc: 0.9254
Epoch 15/50
15497/15497 [==============================] - 11s 726us/step - loss: 0.3303 - acc: 0.9250 - val_loss: 0.2617 - val_acc: 0.9295
Epoch 16/50
15497/15497 [==============================] - 11s 702us/step - loss: 0.3005 - acc: 0.9290 - val_loss: 0.2467 - val_acc: 0.9303
Epoch 17/50
15497/15497 [==============================] - 13s 840us/step - loss: 0.2803 - acc: 0.9299 - val_loss: 0.2419 - val_acc: 0.9133
Epoch 18/50
15497/15497 [==============================] - 12s 777us/step - loss: 0.2583 - acc: 0.9311 - val_loss: 0.2154 - val_acc: 0.9365
Epoch 19/50
15497/15497 [==============================] - 12s 768us/step - loss: 0.2454 - acc: 0.9353 - val_loss: 0.2099 - val_acc: 0.9339
Epoch 20/50
15497/15497 [==============================] - 12s 767us/step - loss: 0.2310 - acc: 0.9395 - val_loss: 0.1959 - val_acc: 0.9435
Epoch 21/50
15497/15497 [==============================] - 13s 824us/step - loss: 0.2141 - acc: 0.9433 - val_loss: 0.1905 - val_acc: 0.9440
Epoch 22/50
15497/15497 [==============================] - 13s 840us/step - loss: 0.2038 - acc: 0.9450 - val_loss: 0.1812 - val_acc: 0.9466
Epoch 23/50
15497/15497 [==============================] - 11s 726us/step - loss: 0.1983 - acc: 0.9450 - val_loss: 0.1755 - val_acc: 0.9492
Epoch 24/50
15497/15497 [==============================] - 11s 726us/step - loss: 0.1856 - acc: 0.9497 - val_loss: 0.1713 - val_acc: 0.9471
Epoch 25/50
15497/15497 [==============================] - 12s 794us/step - loss: 0.1785 - acc: 0.9513 - val_loss: 0.1691 - val_acc: 0.9448
Epoch 26/50
15497/15497 [==============================] - 12s 774us/step - loss: 0.1762 - acc: 0.9480 - val_loss: 0.1607 - val_acc: 0.9520
Epoch 27/50
15497/15497 [==============================] - 13s 844us/step - loss: 0.1633 - acc: 0.9557 - val_loss: 0.1601 - val_acc: 0.9528
Epoch 28/50
15497/15497 [==============================] - 12s 756us/step - loss: 0.1569 - acc: 0.9558 - val_loss: 0.1550 - val_acc: 0.9512
Epoch 29/50
15497/15497 [==============================] - 12s 802us/step - loss: 0.1546 - acc: 0.9524 - val_loss: 0.1541 - val_acc: 0.9541
Epoch 30/50
15497/15497 [==============================] - 13s 816us/step - loss: 0.1472 - acc: 0.9568 - val_loss: 0.1477 - val_acc: 0.9546
Epoch 31/50
15497/15497 [==============================] - 13s 812us/step - loss: 0.1423 - acc: 0.9593 - val_loss: 0.1466 - val_acc: 0.9535
Epoch 32/50
15497/15497 [==============================] - 13s 827us/step - loss: 0.1383 - acc: 0.9586 - val_loss: 0.1395 - val_acc: 0.9546
Epoch 33/50
15497/15497 [==============================] - 12s 772us/step - loss: 0.1333 - acc: 0.9597 - val_loss: 0.1454 - val_acc: 0.9427
Epoch 34/50
15497/15497 [==============================] - 12s 784us/step - loss: 0.1359 - acc: 0.9558 - val_loss: 0.1361 - val_acc: 0.9556
Epoch 35/50
15497/15497 [==============================] - 13s 832us/step - loss: 0.1291 - acc: 0.9590 - val_loss: 0.1492 - val_acc: 0.9368
Epoch 36/50
15497/15497 [==============================] - 12s 779us/step - loss: 0.1281 - acc: 0.9582 - val_loss: 0.1324 - val_acc: 0.9559
Epoch 37/50
15497/15497 [==============================] - 12s 761us/step - loss: 0.1250 - acc: 0.9598 - val_loss: 0.1310 - val_acc: 0.9566
Epoch 38/50
15497/15497 [==============================] - 13s 819us/step - loss: 0.1200 - acc: 0.9632 - val_loss: 0.1387 - val_acc: 0.9559
Epoch 39/50
15497/15497 [==============================] - 12s 774us/step - loss: 0.1189 - acc: 0.9604 - val_loss: 0.1308 - val_acc: 0.9572
Epoch 40/50
15497/15497 [==============================] - 12s 773us/step - loss: 0.1183 - acc: 0.9595 - val_loss: 0.1278 - val_acc: 0.9556
Epoch 41/50
15497/15497 [==============================] - 12s 771us/step - loss: 0.1137 - acc: 0.9637 - val_loss: 0.1337 - val_acc: 0.9471
Epoch 42/50
15497/15497 [==============================] - 12s 767us/step - loss: 0.1119 - acc: 0.9649 - val_loss: 0.1242 - val_acc: 0.9579
Epoch 43/50
15497/15497 [==============================] - 12s 777us/step - loss: 0.1079 - acc: 0.9655 - val_loss: 0.1242 - val_acc: 0.9582
Epoch 44/50
15497/15497 [==============================] - 12s 779us/step - loss: 0.1069 - acc: 0.9665 - val_loss: 0.1231 - val_acc: 0.9574
Epoch 45/50
15497/15497 [==============================] - 12s 777us/step - loss: 0.1071 - acc: 0.9655 - val_loss: 0.1212 - val_acc: 0.9577
Epoch 46/50
15497/15497 [==============================] - 12s 795us/step - loss: 0.1114 - acc: 0.9603 - val_loss: 0.1241 - val_acc: 0.9528
Epoch 47/50
15497/15497 [==============================] - 13s 844us/step - loss: 0.1085 - acc: 0.9604 - val_loss: 0.1187 - val_acc: 0.9590
Epoch 48/50
15497/15497 [==============================] - 13s 813us/step - loss: 0.1024 - acc: 0.9655 - val_loss: 0.1206 - val_acc: 0.9561
Epoch 49/50
15497/15497 [==============================] - 12s 779us/step - loss: 0.1028 - acc: 0.9648 - val_loss: 0.1191 - val_acc: 0.9585
Epoch 50/50
15497/15497 [==============================] - 11s 717us/step - loss: 0.1111 - acc: 0.9586 - val_loss: 0.1234 - val_acc: 0.9533

Neural networks: Model Validation

In [64]:
#--------Plotting No of iterations vs. Categorical Cross entropy of a single neural network with optimiser = rmsprop
def plotting_cross_entropy():
    plt.plot(history2.history['loss'],'o-')
    plt.xlabel("Number of iterations")
    plt.ylabel("Categorical cross entropy")
    plt.title("Train error vs. no of iterations")

#--------Finding accuracy of the nn classifiers on test data
def nn_evaluation_train():
    print('The training accuracy of neural network sgd classifier is {:.2f}'.format(np.mean(history1.history['acc'])*100))
    print('The training accuracy of neural network rmsprop classifier is {:.2f}'.format(np.mean(history2.history['acc'])*100))
    print('The training accuracy of neural network adadelta classifier is {:.2f}'.format(np.mean(history3.history['acc'])*100))

def nn_evaluation_test():
    print('The validation accuracy of neural network sgd classifier is {:.2f}'.format(np.mean(history1.history['val_acc'])*100))
    print('The validation accuracy of neural network rmsprop classifier is {:.2f}'.format(np.mean(history2.history['val_acc'])*100))
    print('The validation accuracy of neural network adadelta classifier is {:.2f}'.format(np.mean(history3.history['val_acc'])*100))


#print("%s: %.2f%%" % (model.metrics_names[1], scores[1]*100))
#scores = model1.evaluate(X_test, Y_test, verbose=0)

#-------Extracting the class of the target variable post the predict operation
def class_extraction():
    y_prob_model1 = model1.predict_proba(X_test)
    y_class_model1 = y_prob_model1.argmax(axis=-1)

    y_prob_model2 = model2.predict_proba(X_test)
    y_class_model2 = y_prob_model2.argmax(axis=-1)

    y_prob_model3 = model3.predict_proba(X_test)
    y_class_model3 = y_prob_model3.argmax(axis=-1)

class_extraction()
nn_evaluation_train()
nn_evaluation_test()
plotting_cross_entropy()

    
The training accuracy of neural network sgd classifier is 58.53
The training accuracy of neural network rmsprop classifier is 92.35
The training accuracy of neural network adadelta classifier is 86.51
The validation accuracy of neural network sgd classifier is 60.70
The validation accuracy of neural network rmsprop classifier is 92.69
The validation accuracy of neural network adadelta classifier is 87.31

Xgboost - Model training and validation

  • XGBoost is an implementation of Gradient Boosted Decision Trees algorithm.
  • This algorithm goes through cycles that repeatedly builds new models and combines them into an ensemble model.
  • Each cycle tries to minimise the error of the previous model predictions.

In this section , I have been a simple xgboost model that provides us very good accuracy on the validation dataset.

In [33]:
from sklearn.preprocessing import LabelEncoder

#Encode the target variable
label_encoder = LabelEncoder()
label_encoder_train = label_encoder.fit(y_train)
Y_train = label_encoder_train.transform(y_train)

label_encoder_test = label_encoder.fit(y_test)
Y_test = label_encoder_test.transform(y_test)

#Scaling the features
X_train = StandardScaler().fit(x_train).transform(x_train)
X_test = StandardScaler().fit(x_test).transform(x_test)

#Training the model
xgbc = xgb.XGBClassifier(seed=42, n_estimators=50, early_stopping_rounds = 5, learning_rate=0.05)
xgbc.fit(X_train, Y_train)

#Evaluate on test data
def evaluate_xg():
    preds_xgb = xgbc.predict(X_test)
    print('Accuracy of Xgboost classifier is {}'.format(accuracy_score(y_tesY, preds_xgb)*100))

evaluate_xg()
Accuracy of Xgboost classifier is0.989677419355

RandomForest: Model Training and validation

In this section I have used randomforest models for training the data. This is another ensemble based model that can provide good results.

In [34]:
#Preparing training and testing data
factor = pd.factorize(data_subset['gname'])
data_subset.gname = factor[0]
definitions = factor[1]

x = data_subset.drop(['gname'], axis=1)
y = data_subset['gname']
x_train, x_test, y_train, y_test = train_test_split(x, y,test_size=0.1, random_state=42)

scaler = StandardScaler()
X_train = scaler.fit_transform(x_train)
X_test = scaler.transform(x_test)

rfrst = RandomForestClassifier(n_jobs=4, n_estimators=50, max_features='sqrt', criterion = 'entropy', random_state=42)
rfrst.fit(X_train, y_train)
preds_rfrst = rfrst.predict(X_test)
print('Accuracy of Rf model on validation dataset is : {}'.format(accuracy_score(y_test, preds_rfrst)))
Accuracy of Rf model on validation dataset is : 0.995356037152

Knn classifier - Model Training and validation

In [40]:
#knn classifier
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=3, p=2, metric='minkowski')
knn.fit(X_train, y_train)

print('The accuracy of the knn classifier is {:.2f} on training data'.format(knn.score(X_train, y_train)))
print('The accuracy of the knn classifier is {:.2f} on test data'.format(knn.score(X_test, y_test)))
The accuracy of the knn classifier is 0.95 on training data
The accuracy of the knn classifier is 0.89 on test data

Adaboost - Model Training and validation

Another boosting technique useful for multi-class predictions.

In [81]:
from sklearn.ensemble import AdaBoostClassifier
from sklearn import model_selection
from sklearn.tree import DecisionTreeClassifier

#Adaboost classifier
dt = DecisionTreeClassifier() 
model = AdaBoostClassifier(n_estimators=100,base_estimator=dt, random_state = 4)
model_ada = model.fit(X_train, y_train)
pred_ada = model_ada.predict(X_test)

print('Accuracy of adaboost model on validation dataset is : {}'.format(accuracy_score(y_test, pred_ada)))

#results = model_selection.cross_val_score(model_ada,X_test,y_test)
Accuracy of adaboost model on validation dataset is : 0.993121238177

Feature Importance

Having too many irrelavant features can sometimes decrease the accuracy of the models. Feature selection provides the following benefits:

  1. Reduces Overfitting
  2. Improves accuracy of the models
  3. Last but not the least, reduces training time!
In [57]:
#Encode the target variable
x = data_subset.drop(['gname'], axis=1)
y = data_subset["gname"]
label_encoder = LabelEncoder()
label_encoder_train = label_encoder.fit(y_train)
Y_train = label_encoder_train.transform(y_train)
label_encoder_test = label_encoder.fit(y_test)
Y_test = label_encoder_test.transform(y_test)

#Scaling the features
X_train = StandardScaler().fit(x_train).transform(x_train)
X_test = StandardScaler().fit(x_test).transform(x_test)

#Feature Importance
model_feature_imp = ExtraTreesClassifier(n_estimators=250,random_state=0)
model_feature_imp.fit(X_train, Y_train)

list_feat = {}
for feature, importance in zip(x.columns, model_feature_imp.feature_importances_):
    list_feat[feature] = importance

importances = pd.DataFrame.from_dict(list_feat, orient='index').rename(columns={0: 'Gini-importance'})
importances.sort_values(by='Gini-importance').plot(kind='bar', rot=90)
table = importances.sort_values(by='Gini-importance',ascending= False)
table.head(50)
Out[57]:
Gini-importance
natlty1_4.0 0.047871
country_4 0.045315
country_159 0.036618
region_6 0.034445
INT_ANY_0.0 0.031268
region_3 0.031175
INT_ANY_1.0 0.026623
natlty1_159.0 0.025678
region_10 0.024060
country_95 0.022711
region_11 0.022358
natlty1_95.0 0.021578
country_603 0.020979
country_92 0.019530
country_45 0.019219
country_185 0.019084
natlty1_92.0 0.019065
region_8 0.018035
natlty1_45.0 0.017365
region_2 0.016971
country_61 0.016558
nperps 0.016478
natlty1_61.0 0.015800
country_147 0.015224
region_5 0.015067
natlty1_185.0 0.014920
natlty1_147.0 0.014375
natlty1_209.0 0.013994
country_182 0.013907
country_160 0.013255
country_209 0.012537
natlty1_153.0 0.012363
natlty1_160.0 0.012234
country_186 0.012005
natlty1_186.0 0.010980
country_153 0.010889
natlty1_182.0 0.010798
country_43 0.008688
country_228 0.008654
natlty1_228.0 0.008280
natlty1_43.0 0.007883
nkillter 0.007772
natlty1_233.0 0.007568
country_69 0.007428
weapsubtype1_16.0 0.004991
country_183 0.004874
natlty1_183.0 0.004853
natlty1_69.0 0.004631
weapsubtype1_5.0 0.004404
weapsubtype1_15.0 0.004298

Looks like the top features include region, nationality of incident place, weapon type, etc.

Model Performances

  • Following is a table that shows the accuracies of different models on the validaiton dataset.
  • We can clearly see that nn_sgd seems to be performing the worst. Rest of the models have an accuracy > 85%
Model Accuracy
nn_sgd 60%
nn_rmsprop 92%
nn_adadelta 87%
xgboost 98%
randomforest 99%
knn 89%
adaboost 99%

Next steps

Following are some other things that I haven't gotten to but extremely important to improve the accuracy of our models.

  1. From the feature importance table above, use only those features that are important to build new models
  2. Use K-fold cross validation methods to calculate accuracy of your predictions.
  3. Use the above models to build another ensemble model.